-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Preserve Field throughout Variant shredding flow
#8673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| pub fn with_column(mut self, field_name: &str, array: ArrayRef, nullable: bool) -> Self { | ||
| let field = Field::new(field_name, array.data_type().clone(), nullable); | ||
| self.fields.push(Arc::new(field)); | ||
| self.arrays.push(array); | ||
| self | ||
| } | ||
|
|
||
| pub fn with_field(mut self, field: FieldRef, array: ArrayRef) -> Self { | ||
| self.fields.push(field); | ||
| self.arrays.push(array); | ||
| self | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the naming here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a convenience method, right?
We already have the datatype (from the array), so maybe we just need pass an optional metadata?, making this similar to the Field constructor? Or, if that's too disruptive for exiting callers, with_column_and_metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about with_column_name and with_field_ref
| index: usize, | ||
| ) -> Variant<'a, 'a> { | ||
| let data_type = typed_value.data_type(); | ||
| let (_typed_value_field, typed_value_column) = typed_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the field information, we can check for extension types like: e21dc1b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a stab at this. I'm not immediately sure how to react, so I just left a couple high level comments while I stew on it more.
| .typed_value_field() | ||
| .unwrap() | ||
| .1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just glancing at this code in isolation, I would guess that .0 is the field and .1 is the array?
But that requires knowing context (ie of this PR)
If this usage will show up often, is it worth returning a newtype instead of a tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , please, let's return a type with real field names and documentation
| let typed_value = if let Some(typed_value_array) = typed_value.clone() { | ||
| let field_ref = Arc::new(Field::new( | ||
| "typed_value", | ||
| typed_value_array.data_type().clone(), | ||
| true, | ||
| )); | ||
| builder = builder.with_field(field_ref.clone(), typed_value_array.clone()); | ||
|
|
||
| Some((field_ref, typed_value_array)) | ||
| } else { | ||
| None | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is just Option::map (no ? to complicate things)
| let typed_value = if let Some(typed_value_array) = typed_value.clone() { | ||
| let field_ref = Arc::new(Field::new( | ||
| "typed_value", | ||
| typed_value_array.data_type().clone(), | ||
| true, | ||
| )); | ||
| builder = builder.with_field(field_ref.clone(), typed_value_array.clone()); | ||
|
|
||
| Some((field_ref, typed_value_array)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of the dual typing (one in field and one in array), this code is technically no longer infallible -- the data types could disagree. Not sure the best way to address that issue as long as we're passing a full-blown field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this logic is duplicated with the variant array constructor above.
I think one thing I'm debating is the coupling between the |
| .typed_value_field() | ||
| .unwrap() | ||
| .1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , please, let's return a type with real field names and documentation
| } | ||
| } | ||
|
|
||
| pub type TypedValue = (FieldRef, ArrayRef); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This combination of an Array (which only has a DataType) and a Field (which has the DataType and metadata information) has come up recently upstream as @paleolimbot is working extension types through DataFusion as well in
We are still looking for a good pattern to use.
I wonder if something like this is too crazy
/// A value that has associated Arrow field infomration
struct TypedValue<T> {
value: T, // could be ArrayRef, could be ScalarValue
field: FieldRef,
}🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably struct TypedArrayRef { value: ArrayRef, field: FieldRef } is best for now (keeping the FieldRef-is-a-type bit opaque). Using FieldRefs as types in DataFusion has mostly led to a lot of cloned metadata and heterogeneity with respect to how the name/metadata/nullability should propagate (or not) but it does seem to be the path of least resistance.
Which issue does this PR close?
FixedSizeBinary(16)shredding #8665Rationale for this change
This PR preserves the
typed_value'sFieldmetadata. This way, we can check for extension types.